Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations
نویسندگان
چکیده
Background: Automatic identification of gene and protein names from biomedical publications can help curators and researchers to keep up with the findings published in the scientific literature. As of today, this is a challenging task related to information retrieval, and in the realm of Big Data Analytics. Objectives: To investigate the feasibility of using word embed-dings (i.e. distributed word representations) from Deep Learning algorithms together with terms from the Cardiovascular Disease Ontology (CVDO) as a step to identifying omics information encoded in the biomedical literature. Methods: Word embeddings were generated using the neural language models CBOW and Skip-gram with an input of more than 14 million PubMed citations (titles and abstracts) corresponding to articles published between 2000 and 2016. Then the abstracts of selected papers from the sysVASC systematic review were manually annotated with gene/protein names. We set up two experiments that used the word embeddings to produce term variants for gene/protein names: the first experiment used the terms manually annotated from the papers; the second experiment enriched/expanded the annotated terms using terms from the human-readable labels of key classes (gene/proteins) from the CVDO ontology. CVDO is formalised in the W3C Web Ontology Language (OWL) and contains 172,121 UniProt Knowledgebase protein classes related to human and 86,792 UniProtKB protein classes related to mouse. The hypothesis is that by enriching the original annotated terms, a better context is provided, and therefore, it is easier to obtain suitable (full and/or partial) term variants for gene/protein names from word embed-dings. Results: From the papers manually annotated, a list of 107 terms (gene/protein names) was acquired. As part of the word embeddings generated from CBOW and Skip-gram, a lexicon with more than 9 million terms was created. Using the cosine similarity metric, a list of the 12 top-ranked terms was generated from word embeddings for query terms present in the generated lexicon. Domain experts evaluated a total of 1968 pairs of terms and classified the retrieved terms as: TV (term variant); PTV (partial term variant); and NTV (non term variant, meaning none of the previous two categories). In experiment I, Skip-gram finds the double amount of (full and/or partial) term variants for gene/protein names as compared with CBOW. Using Skip-gram, the weighted Cohen's Kappa inter-annotator agreement for two domain experts was 0.80 for the first experiment and 0.74 for the second experiment. In the first experiment, suitable (full and/or partial) term variants were found …
منابع مشابه
Development of a Combined System Based on Data Mining and Semantic Web for the Diagnosis of Autism
Introduction: Autism is a nervous system disorder, and since there is no direct diagnosis for it, data mining can help diagnose the disease. Ontology as a backbone of the semantic web, a knowledge database with shareability and reusability, can be a confirmation of the correctness of disease diagnosis systems. This study aimed to provide a system for diagnosing autistic children with a combinat...
متن کاملDevelopment of a Combined System Based on Data Mining and Semantic Web for the Diagnosis of Autism
Introduction: Autism is a nervous system disorder, and since there is no direct diagnosis for it, data mining can help diagnose the disease. Ontology as a backbone of the semantic web, a knowledge database with shareability and reusability, can be a confirmation of the correctness of disease diagnosis systems. This study aimed to provide a system for diagnosing autistic children with a combinat...
متن کاملDeveloping a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information
With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...
متن کاملAn Executive Approach Based On the Production of Fuzzy Ontology Using the Semantic Web Rule Language Method (SWRL)
Today, the need to deal with ambiguous information in semantic web languages is increasing. Ontology is an important part of the W3C standards for the semantic web, used to define a conceptual standard vocabulary for the exchange of data between systems, the provision of reusable databases, and the facilitation of collaboration across multiple systems. However, classical ontology is not enough ...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کامل